Automatic Table Ground Truth Generation and a Background-Analysis-Based Table Structure Extraction Method

نویسندگان

  • Yalin Wang
  • Robert M. Haralick
  • Ihsin T. Phillips
چکیده

In this paper, we first describe an automatic table ground truth generation system which can efficiently generate a large amount of accurate table ground truth suitable for the development of table detection algorithms. Then a novel background-analysis-based, coarse-to-fine table identification algorithm and an X-Y cut table decomposition algorithm are described. We discuss an experimental protocol to evaluate the table detection algorithms. For a total of 1; 125 document pages having 518 table entities and a total of 10; 941 cell entities, our table detection algorithm takes line, word segmentation results as input and obtains around 90% cell correct detection rates.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Random Table and Its Ground Truth Automatic Generation: A Tool for Table Understanding Research

We developed a software tool to assist table understanding research. It can analyze any given table ground truth and generate documents that include similar table elements while have more variety on both table and non-table parts. Based on our novel content matching ground truthing idea, the table ground truth data for the generated table elements become available with little manual work. The v...

متن کامل

Table structure understanding and its performance evaluation

With the large number of existing documents and the increasing speed in the production of new documents, finding efficient methods to process these documents for their content retrieval and storage becomes critical. Tables are a popular and efficient document element type. Therefore, table structure understanding is an important problem in the document layout analysis field. This paper presents...

متن کامل

Document Layout Structure Extraction Using Bounding Boxes of Diierent Entities

This paper presents an eecient and accurate technique for document page layout structure extraction and classiication by analyzing the spatial connguration of the bounding boxes of diierent entities on a given image. The text, table, and nontext structures are detected on document images. The text-lines and words are extracted and the tabular structure is further decomposed into row and column ...

متن کامل

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

Table Metadata: Headers, Augmentations and Aggregates

A sample of 200 web tables was interactively converted into layout-independent Augmented Wang Notation (AWN) using the Table Abstraction Tool (TAT). The resulting XML ground-truth files list for each table (1) cell contents, (2) relationships between the hierarchical column and row headers and the value/content/data cells, (3) designators for aggregates like totals and averages, and (4) ancilla...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001